58 research outputs found

    Visual and computational analysis of structure-activity relationships in high-throughput screening data

    Get PDF
    Novel analytic methods are required to assimilate the large volumes of structural and bioassay data generated by combinatorial chemistry and high-throughput screening programmes in the pharmaceutical and agrochemical industries. This paper reviews recent work in visualisation and data mining that can be used to develop structure-activity relationships from such chemical/biological datasets

    Reserve replacement in the oil and gas industry -A study on cost differences

    Get PDF
    Background and the question at issue: When attempting to increment oil and gas reserves, an oil and gas company typically has two possibilities, either to prospect and develop reserves or to acquire reserves through a takeover of another company with proved reserves. In this thesis, the reader will find an approximated answer to the question at issue about which of these alternatives is the most cost effective. Purpose: The thesis was written with the intent to fill some of the holes in the academic literature regarding the cost effectiveness, related to the increment of oil and gas reserves. The paper also contains a discussion about the determents of value and costs in the oil and gas industry, written with the intention to contribute to the illumination of the economical dynamics of the industry. Delimitations: Some generalizations have been made in this thesis in order to increment the transparency and perspicuousness of the study. Methodology: This thesis was enabled by a thorough study of relevant academic research and empirical data, e.g. annual reports and press releases. The costs and outcomes of exploration activities between the years 2009-2013 was gathered from the ten largest oil and gas companies, according to the market capitalization per 2015-04-15, and compared to eight acquisitions that where considered to be a appropriate. Conclusions: The findings of this study indicate that, from a strictly economic perspective, that the alternative to prospect and develop oil and gas reserves is the most cost effective way to increment reserves, although the findings are not statistically significant. Possible stakeholders: This thesis should be interesting to anyone who takes particular interest in the oil and gas industry

    Exploiting QSAR models in lead optimization.

    No full text
    QSAR models can play a vital role in both the opening phase and the endgame of lead optimization. In the opening phase, there is often a large quantity of data from high-throughput screening (HTS), and potential leads need to be selected from several distinct chemotypes. In the endgame, the throughput of the final, critical ADMET and pharmacokinetic assays is often not sufficient to allow full experimental characterization of all the structures in the available time. A considerable amount of the current research toward new QSAR models is based on the modeling of the general ADMET phenomena, with the aim of constructing globally applicable models. The process to construct QSAR models is relatively straightforward; however, it is also simple to build misleading, or even incorrect, models. This review considers the key developments in the field of QSAR modeling: how QSAR models are constructed, how they can be validated, their reliability and their applicability. If applied carefully and appropriately, the QSAR technique has a valuable role to play during lead optimization

    Leave-cluster-out crossvalidation is appropriate for Scoring Functions derived on diverse protein datasets

    No full text
    With the emergence of large collections of protein-ligand complexes complemented by binding data, as found in PDBbind or BindingMOAD, new opportunities for parameterizing and evaluating scoring functions arise. With huge data collections available it becomes feasible to fit scoring functions in a QSAR style, i.e. by defining protein-ligand interaction descriptors and analyzing them with modern machine-learning methods. As in each data modelling ansatz, care has to be taken to validate the model carefully. Here we show that there are large differences measured in R (0.77 vs. 0.46) or R2 (0.59 vs. 0.21) for a relatively simple scoring function depending on whether it is validated against the PDBbind core set or validated in a leave-cluster-out crossvalidation. If proteins from the same family are present in both training and validation set, the estimated prediction quality from standard validation techniques looks too optimistic

    Three descriptor model can predict 55% of the CSAR-NRC HiQ benchmark dataset

    No full text
    Here we report the results we obtained with a proteochemometric approach for predicting ligand binding free energies of the CSAR-NRC HiQ benchmark data set. Using distance-dependent atom-type pair descriptors in a bagged stepwise multiple-linear regression (MLR) model with subsequent complexity reduction we were able to identify three descriptors that can be used to build a very robust regression model for the CSAR-NRC HiQ data set. The model has an R(2)(cv) of 0.55, a MUE(cv) of 1.19, and an RMSE(cv) of 1.49 on the out-of-bag test set. The descriptors selected are the count of protein atoms in a shell between 4.5 Ã… and 6 Ã… around each heavy ligand atom excluding oxygen and phosphorus, the count of sulfur atoms in the vicinity of tryptophan, and the count of aliphatic ligand hydroxy hydrogens. The first two descriptors have a positive sign indicating that they contribute favorably to the binding energy, whereas the count of hydroxy hydrogens contributes unfavorably to the binding free energy observed. The fact that such a simple model can be so effective raises a couple of questions that are addressed in the article

    Global Free Energy Scoring Functions based on distance-dependent Atom-type Pair Descriptors

    No full text
    Scoring functions for protein-ligand docking have received much attention in the past two decades. In many cases, remarkable success has been demonstrated in predicting the correct geometry of interaction. On independent test sets, however, the predicted binding energies or scores correlate only slightly with the observed free energies of binding. In this study, we analyze how well free energies of binding can be predicted on the basis of crystal structures using traditional QSAR techniques in a proteochemometric approach. We introduce a new set of protein-ligand interaction descriptors on the basis of distance-binned Crippen-like atom type pairs. A subset of the publicly available PDBbind09-CN refined set (MW < 900 g/mol, #P < 2, ndon + nacc < 20; N = 1387) is being used as data set. It is demonstrated how simple, yet surprisingly good, scoring functions can be generated for the whole diverse database (R(2)(out-of-bag) = 0.48, R(p) = 0.69, RMSE = 1.44, MUE = 1.14) and individual protein family subsets. This performance is significantly better than the performance of almost all other scoring functions published that have been validated on a test set as large and diverse as the PDBbind refined set. We also find that on some protein families surprisingly good scoring functions can be obtained using simple ligand-only descriptors like logS, logP, and molecular weight. The ligand-descriptor based scoring function equals or even outperforms commonly used scoring functions, highlighting the need for better scoring functions. We demonstrate how the observed performance depends on the validation strategy, and we outline a general validation protocol for future free energy scoring functions

    QSAR--how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets.

    No full text
    The quality of QSAR (Quantitative Structure-Activity Relationships) predictions depends on a large number of factors including the descriptor set, the statistical method, and the data sets used. Here we study the quality of QSAR predictions mainly as a function of the data set and descriptor type using partial least squares as the statistical modeling method. The study makes use of the fact that we have access to a large number of data sets and to a variety of different QSAR descriptors. The main conclusions are that the quality of the predictions depends both on the data set and the descriptor used. The quality of the predictions correlates positively with the size of the data set and the range of biological activities. There is no clear dependence of the quality of the predictions on the complexity of the data set. All of the descriptors tested produced useful predictions for some of the data sets. None of the descriptors is best for all data sets; it is therefore necessary to test in each individual case, which descriptor produces the best model. In our tests, 2D fragment based descriptors usually performed better than simpler descriptors based on augmented atom types. Possible reasons for these observations are discussed

    Developing collaborative QSAR models without sharing structures

    No full text
    It is widely understood that QSAR models greatly improve if more data are used. However, irrespective of model quality, once chemical structures diverge too far from the initial data set, the predictive performance of a model degrades quickly. To increase the applicability domain we need to increase the diversity of the training set. This can be achieved by combining data from diverse sources. In this contribution, we will present a method for the collaborative development of linear regression models. The method differs from other past approaches, because data are only shared in an aggregated form. This prohibits access to individual data points and therefore avoids the disclosure of confidential structural information. The final models are equivalent to models that were built with combined datasets
    • …
    corecore